Data Cleansing

ทำเพื่อลบ Outlier + Influence Data + Leverage

Outlier = ค่าที่เดา (จากการวาดเส้น) กับค่าแท้จริง ก่อเกิดเป็น Residual เยอะมากผิดปกติ

Outlier Detection

Analyze → Regression → Linear → ปุ่ม Save → กดตามนี้ :

Distance

Residuals

Influence Statistics

Cook’s Distance (วัด Influence)

ยิ่งเข้าใกล้ 1 ยิ่งดี

Mahalanobis Distance (วัด Leverage → อธิบาย Outlier)

วัดค่าระหว่าง ค่าจริงกับค่าที่ predict ออกมา ว่าห่างกันกี่ SD สำหรับ Multiple Regression

คำนวณโดยการใช้ `SIG.CHISQ( calculated MAH, Degree of freedom)` เพื่อหาความ Significant

ค่าไม่เกิน Chi Square ที่ df = จำนวนตัวแปร Predictor (ไม่ต้องไปลบ 1) และจะ significant ที่ P < 0.001

หาก significant = ค่าผิดปกติแล้ว

Leverage Value / Center Leverage Value / Average Leverage (วัด Leverage → อธิบาย Outlier)

ไม่มากกว่า 2 หรือ 3 เท่าของ (k+1/n) (แล้วแต่ว่าอยากใช้ 2 หรือ 3 แต่ส่วนใหญ่ใช้ 2 กัน)

โดยที่

Unstandardized Residuals

Standard Deviation (วัด Outlier)

Analyse → Descriptive Statistics → Descriptive → เลือกตัวแปรแล้วติ๊ก Save standardised values as variables → กด Ok