하둡,spark
spark 데이터프레임 실습4
데이터왕
2024. 1. 24. 16:56
이탤릭체 텍스트
In [2]:
!pip install pyspark==3.3.1 py4j==0.10.9.5
Collecting pyspark==3.3.1 Downloading pyspark-3.3.1.tar.gz (281.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 281.4/281.4 MB 4.3 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Collecting py4j==0.10.9.5 Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.7/199.7 kB 16.7 MB/s eta 0:00:00 Building wheels for collected packages: pyspark Building wheel for pyspark (setup.py) ... done Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845494 sha256=4a276e89899009f3020e1f7b2cf706363a821f382b050d4a969db134953fec17 Stored in directory: /root/.cache/pip/wheels/0f/f0/3d/517368b8ce80486e84f89f214e0a022554e4ee64969f46279b Successfully built pyspark Installing collected packages: py4j, pyspark Attempting uninstall: py4j Found existing installation: py4j 0.10.9.7 Uninstalling py4j-0.10.9.7: Successfully uninstalled py4j-0.10.9.7 Successfully installed py4j-0.10.9.5 pyspark-3.3.1
In [3]:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/usr/local/lib/python3.7/dist-packages/pyspark/jars/RedshiftJDBC42-no-awssdk-1.2.20.1043.jar") \
.getOrCreate()
In [33]:
# 스텍오버플로 개발자 설문 다운
!wget https://s3-geospatial.s3-us-west-2.amazonaws.com/survey_results_public.csv
--2024-01-24 06:59:38-- https://s3-geospatial.s3-us-west-2.amazonaws.com/survey_results_public.csv Resolving s3-geospatial.s3-us-west-2.amazonaws.com (s3-geospatial.s3-us-west-2.amazonaws.com)... 52.92.148.194, 52.92.243.202, 52.92.208.154, ... Connecting to s3-geospatial.s3-us-west-2.amazonaws.com (s3-geospatial.s3-us-west-2.amazonaws.com)|52.92.148.194|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 81101949 (77M) [text/csv] Saving to: ‘survey_results_public.csv.1’ survey_results_publ 100%[===================>] 77.34M 27.8MB/s in 2.8s 2024-01-24 06:59:42 (27.8 MB/s) - ‘survey_results_public.csv.1’ saved [81101949/81101949]
In [5]:
!ls -tl
total 79212 drwxr-xr-x 1 root root 4096 Jan 19 14:20 sample_data -rw-r--r-- 1 root root 81101949 Jan 15 2023 survey_results_public.csv
In [6]:
!head -5 survey_results_public.csv
ResponseId,MainBranch,Employment,Country,US_State,UK_Country,EdLevel,Age1stCode,LearnCode,YearsCode,YearsCodePro,DevType,OrgSize,Currency,CompTotal,CompFreq,LanguageHaveWorkedWith,LanguageWantToWorkWith,DatabaseHaveWorkedWith,DatabaseWantToWorkWith,PlatformHaveWorkedWith,PlatformWantToWorkWith,WebframeHaveWorkedWith,WebframeWantToWorkWith,MiscTechHaveWorkedWith,MiscTechWantToWorkWith,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,NEWCollabToolsHaveWorkedWith,NEWCollabToolsWantToWorkWith,OpSys,NEWStuck,NEWSOSites,SOVisitFreq,SOAccount,SOPartFreq,SOComm,NEWOtherComms,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility,MentalHealth,SurveyLength,SurveyEase,ConvertedCompYearly 1,I am a developer by profession,"Independent contractor, freelancer, or self-employed",Slovakia,NA,NA,"Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)",18 - 24 years,"Coding Bootcamp;Other online resources (ex: videos, blogs, etc)",NA,NA,"Developer, mobile",20 to 99 employees,EUR European Euro,4800,Monthly,C++;HTML/CSS;JavaScript;Objective-C;PHP;Swift,Swift,PostgreSQL;SQLite,SQLite,NA,NA,Laravel;Symfony,NA,NA,NA,NA,NA,PHPStorm;Xcode,Atom;Xcode,MacOS,Call a coworker or friend;Visit Stack Overflow;Go for a walk or other physical activity;Google it,Stack Overflow,Multiple times per day,Yes,A few times per month or weekly,"Yes, definitely",No,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,62268 2,I am a student who is learning to code,"Student, full-time",Netherlands,NA,NA,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc);School",7,NA,NA,NA,NA,NA,NA,JavaScript;Python,NA,PostgreSQL,NA,NA,NA,Angular;Flask;Vue.js,NA,Cordova,NA,Docker;Git;Yarn,Git,Android Studio;IntelliJ;Notepad++;PyCharm,NA,Windows,Visit Stack Overflow;Google it,Stack Overflow,Daily or almost daily,Yes,Daily or almost daily,"Yes, definitely",No,18-24 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,NA 3,"I am not primarily a developer, but I write code sometimes as part of my work","Student, full-time",Russian Federation,NA,NA,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc);Online Forum",NA,NA,NA,NA,NA,NA,NA,Assembly;C;Python;R;Rust,Julia;Python;Rust,SQLite,SQLite,Heroku,NA,Flask,Flask,NumPy;Pandas;TensorFlow;Torch/PyTorch,Keras;NumPy;Pandas;TensorFlow;Torch/PyTorch,NA,NA,IPython/Jupyter;PyCharm;RStudio;Sublime Text;Visual Studio Code,IPython/Jupyter;RStudio;Sublime Text;Visual Studio Code,MacOS,Visit Stack Overflow;Google it;Watch help / tutorial videos;Do other work and come back later,Stack Overflow;Stack Exchange,Multiple times per day,Yes,Multiple times per day,"Yes, definitely",Yes,18-24 years old,Man,No,Prefer not to say,Prefer not to say,None of the above,None of the above,Appropriate in length,Easy,NA 4,I am a developer by profession,Employed full-time,Austria,NA,NA,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",11 - 17 years,NA,NA,NA,"Developer, front-end",100 to 499 employees,EUR European Euro,NA,Monthly,JavaScript;TypeScript,JavaScript;TypeScript,NA,NA,NA,NA,Angular;jQuery,Angular;jQuery,NA,NA,NA,NA,NA,NA,Windows,Call a coworker or friend;Visit Stack Overflow;Go for a walk or other physical activity;Google it,Stack Overflow,Daily or almost daily,Yes,Daily or almost daily,Neutral,No,35-44 years old,Man,No,Straight / Heterosexual,White or of European descent,I am deaf / hard of hearing,NA,Appropriate in length,Neither easy nor difficult,NA
In [7]:
df = spark.read.csv("survey_results_public.csv", header=True).select('ResponseId', 'LanguageHaveWorkedWith', 'LanguageWantToWorkWith')
In [8]:
df.printSchema()
root |-- ResponseId: string (nullable = true) |-- LanguageHaveWorkedWith: string (nullable = true) |-- LanguageWantToWorkWith: string (nullable = true)
In [9]:
import pyspark.sql.functions as F
# LanguageHaveWorkedWith 값을 트림하고 ;를 가지고 나눠서 리스트의 형태로 language_have 필드로 설정
df2 = df.withColumn(
"language_have",
F.split(F.trim(F.col("LanguageHaveWorkedWith")), ";")
)
In [10]:
df2.show(5)
+----------+----------------------+----------------------+--------------------+ |ResponseId|LanguageHaveWorkedWith|LanguageWantToWorkWith| language_have| +----------+----------------------+----------------------+--------------------+ | 1| C++;HTML/CSS;Java...| Swift|[C++, HTML/CSS, J...| | 2| JavaScript;Python| NA|[JavaScript, Python]| | 3| Assembly;C;Python...| Julia;Python;Rust|[Assembly, C, Pyt...| | 4| JavaScript;TypeSc...| JavaScript;TypeSc...|[JavaScript, Type...| | 5| Bash/Shell;HTML/C...| Bash/Shell;HTML/C...|[Bash/Shell, HTML...| +----------+----------------------+----------------------+--------------------+ only showing top 5 rows
In [11]:
# LanguageWantToWorkWith 값을 트림하고 ;를 가지고 나눠서 리스트의 형태로 language_want 필드로 설정
df3 = df2.withColumn(
"language_want",
F.split(F.trim(F.col("LanguageWantToWorkWith")), ";")
)
In [12]:
df3.printSchema()
root |-- ResponseId: string (nullable = true) |-- LanguageHaveWorkedWith: string (nullable = true) |-- LanguageWantToWorkWith: string (nullable = true) |-- language_have: array (nullable = true) | |-- element: string (containsNull = false) |-- language_want: array (nullable = true) | |-- element: string (containsNull = false)
In [13]:
df3.show(5)
+----------+----------------------+----------------------+--------------------+--------------------+ |ResponseId|LanguageHaveWorkedWith|LanguageWantToWorkWith| language_have| language_want| +----------+----------------------+----------------------+--------------------+--------------------+ | 1| C++;HTML/CSS;Java...| Swift|[C++, HTML/CSS, J...| [Swift]| | 2| JavaScript;Python| NA|[JavaScript, Python]| [NA]| | 3| Assembly;C;Python...| Julia;Python;Rust|[Assembly, C, Pyt...|[Julia, Python, R...| | 4| JavaScript;TypeSc...| JavaScript;TypeSc...|[JavaScript, Type...|[JavaScript, Type...| | 5| Bash/Shell;HTML/C...| Bash/Shell;HTML/C...|[Bash/Shell, HTML...|[Bash/Shell, HTML...| +----------+----------------------+----------------------+--------------------+--------------------+ only showing top 5 rows
현재 많이 사용되는 언어들 찾기
In [14]:
# explode 함수는 배열을 풀어서 각 요소를 새로운 행으로 만들어주는 역할
# alias 함수를 사용하여 새로 생성된 컬럼의 이름을 "language_have"로 지정
df_language_have = df3.select(
df3.ResponseId,
F.explode(df3.language_have).alias("language_have")
)
In [15]:
df_language_have.show(10)
+----------+-------------+ |ResponseId|language_have| +----------+-------------+ | 1| C++| | 1| HTML/CSS| | 1| JavaScript| | 1| Objective-C| | 1| PHP| | 1| Swift| | 2| JavaScript| | 2| Python| | 3| Assembly| | 3| C| +----------+-------------+ only showing top 10 rows
In [16]:
df_language_have.groupby("language_have").count().show(10)
+-------------+-----+ |language_have|count| +-------------+-----+ | C#|22984| | VBA| 3847| | Rust| 5799| | Bash/Shell|22385| | JavaScript|53587| | NA| 1082| | Perl| 2028| | Erlang| 651| | Matlab| 3846| | Crystal| 466| +-------------+-----+ only showing top 10 rows
Sorting 두 가지 방법:
- sort & orderBy
- ascending & descending
In [17]:
# 'count' 컬럼을 기준으로 내림차순으로 정렬
df_language_have.groupby("language_have").count().sort(F.desc("count")).collect()
Out[17]:
[Row(language_have='JavaScript', count=53587), Row(language_have='HTML/CSS', count=46259), Row(language_have='Python', count=39792), Row(language_have='SQL', count=38835), Row(language_have='Java', count=29162), Row(language_have='Node.js', count=27975), Row(language_have='TypeScript', count=24909), Row(language_have='C#', count=22984), Row(language_have='Bash/Shell', count=22385), Row(language_have='C++', count=20057), Row(language_have='PHP', count=18130), Row(language_have='C', count=17329), Row(language_have='PowerShell', count=8871), Row(language_have='Go', count=7879), Row(language_have='Kotlin', count=6866), Row(language_have='Rust', count=5799), Row(language_have='Ruby', count=5569), Row(language_have='Dart', count=4965), Row(language_have='Assembly', count=4632), Row(language_have='Swift', count=4204), Row(language_have='R', count=4185), Row(language_have='VBA', count=3847), Row(language_have='Matlab', count=3846), Row(language_have='Groovy', count=2479), Row(language_have='Objective-C', count=2310), Row(language_have='Scala', count=2148), Row(language_have='Perl', count=2028), Row(language_have='Haskell', count=1749), Row(language_have='Delphi', count=1731), Row(language_have='Clojure', count=1552), Row(language_have='Elixir', count=1438), Row(language_have='LISP', count=1096), Row(language_have='NA', count=1082), Row(language_have='Julia', count=1068), Row(language_have='F#', count=804), Row(language_have='Erlang', count=651), Row(language_have='APL', count=536), Row(language_have='Crystal', count=466), Row(language_have='COBOL', count=437)]
In [18]:
# 정렬 두번째 방법
df_language_have.groupby("language_have").count().orderBy('count', ascending=False).collect()
Out[18]:
[Row(language_have='JavaScript', count=53587), Row(language_have='HTML/CSS', count=46259), Row(language_have='Python', count=39792), Row(language_have='SQL', count=38835), Row(language_have='Java', count=29162), Row(language_have='Node.js', count=27975), Row(language_have='TypeScript', count=24909), Row(language_have='C#', count=22984), Row(language_have='Bash/Shell', count=22385), Row(language_have='C++', count=20057), Row(language_have='PHP', count=18130), Row(language_have='C', count=17329), Row(language_have='PowerShell', count=8871), Row(language_have='Go', count=7879), Row(language_have='Kotlin', count=6866), Row(language_have='Rust', count=5799), Row(language_have='Ruby', count=5569), Row(language_have='Dart', count=4965), Row(language_have='Assembly', count=4632), Row(language_have='Swift', count=4204), Row(language_have='R', count=4185), Row(language_have='VBA', count=3847), Row(language_have='Matlab', count=3846), Row(language_have='Groovy', count=2479), Row(language_have='Objective-C', count=2310), Row(language_have='Scala', count=2148), Row(language_have='Perl', count=2028), Row(language_have='Haskell', count=1749), Row(language_have='Delphi', count=1731), Row(language_have='Clojure', count=1552), Row(language_have='Elixir', count=1438), Row(language_have='LISP', count=1096), Row(language_have='NA', count=1082), Row(language_have='Julia', count=1068), Row(language_have='F#', count=804), Row(language_have='Erlang', count=651), Row(language_have='APL', count=536), Row(language_have='Crystal', count=466), Row(language_have='COBOL', count=437)]
In [19]:
'''language_have 컬럼을 기준으로 그룹화하고, 각 그룹별로 레코드 수를 센 다음에 레코드 수를
기준으로 내림차순으로 정렬한 후 상위 50개만 선택하는 작업을 수행'''
df_language50_have = df_language_have.groupby("language_have")\
.count()\
.orderBy('count', ascending=False)\
.limit(50)
In [20]:
# 폴더가 있다면 덮어씀
df_language50_have.write.mode('overwrite').csv("language50_have")
In [21]:
!ls -tl
total 79216 drwxr-xr-x 2 root root 4096 Jan 24 06:34 language50_have drwxr-xr-x 1 root root 4096 Jan 19 14:20 sample_data -rw-r--r-- 1 root root 81101949 Jan 15 2023 survey_results_public.csv
In [22]:
!ls -tl language50_have/
total 4 -rw-r--r-- 1 root root 0 Jan 24 06:34 _SUCCESS -rw-r--r-- 1 root root 447 Jan 24 06:34 part-00000-5d5bebc8-2e53-4d1c-9e5d-4bcc09a24e96-c000.csv
In [24]:
!cat language50_have/part-00000-5d5bebc8-2e53-4d1c-9e5d-4bcc09a24e96-c000.csv
JavaScript,53587 HTML/CSS,46259 Python,39792 SQL,38835 Java,29162 Node.js,27975 TypeScript,24909 C#,22984 Bash/Shell,22385 C++,20057 PHP,18130 C,17329 PowerShell,8871 Go,7879 Kotlin,6866 Rust,5799 Ruby,5569 Dart,4965 Assembly,4632 Swift,4204 R,4185 VBA,3847 Matlab,3846 Groovy,2479 Objective-C,2310 Scala,2148 Perl,2028 Haskell,1749 Delphi,1731 Clojure,1552 Elixir,1438 LISP,1096 NA,1082 Julia,1068 F#,804 Erlang,651 APL,536 Crystal,466 COBOL,437
가장 배우고 싶은 언어들 찾기
In [26]:
# language_want 컬럼을 펼쳐서 배열의 각 요소를 별도의 행으로 만들어주는 작업을 수행.
'''F.explode(df3.language_want).alias("language_want"): language_want 컬럼을 펼쳐서
배열의 각 요소를 별도의 행으로 만들어주는 작업을 수행합니다. explode 함수는 배열을
풀어서 각 요소를 새로운 행으로 만들어주는 역할을 합니다. alias 함수를 사용하여 새로
생성된 컬럼의 이름을 "language_want"로 지정합니다.'''
df_language_want = df3.select(
df3.ResponseId,
F.explode(df3.language_want).alias("language_want")
)
In [27]:
df_language_want.show(5)
+----------+-------------+ |ResponseId|language_want| +----------+-------------+ | 1| Swift| | 2| NA| | 3| Julia| | 3| Python| | 3| Rust| +----------+-------------+ only showing top 5 rows
In [28]:
df_language_want.groupby("language_want").count().show(10)
+-------------+-----+ |language_want|count| +-------------+-----+ | C#|17999| | VBA| 1069| | Rust|15865| | Bash/Shell|14043| | JavaScript|37008| | NA| 6618| | Perl| 1175| | Erlang| 1379| | Matlab| 1562| | Crystal| 790| +-------------+-----+ only showing top 10 rows
In [29]:
df_language50_want = df_language_want.groupby("language_want").count().orderBy('count', ascending=False).limit(50)
In [30]:
df_language50_want.show(10)
+-------------+-----+ |language_want|count| +-------------+-----+ | JavaScript|37008| | Python|34929| | HTML/CSS|29353| | TypeScript|26905| | SQL|26631| | Node.js|24100| | C#|17999| | Java|17222| | Rust|15865| | Go|15788| +-------------+-----+ only showing top 10 rows
In [31]:
df_language50_want.write.mode('overwrite').csv("language50_want")
In [32]:
!ls -tl language50_want/
total 4 -rw-r--r-- 1 root root 0 Jan 24 06:52 _SUCCESS -rw-r--r-- 1 root root 449 Jan 24 06:52 part-00000-d3568d32-b1f8-4aeb-b5ee-faf76a35364e-c000.csv